How to make the most out of disparate palaeo data?

Gavin Simpson

Aarhus University

2025-07-02

What are our data?

What does AI think?

Source: Sora

…but can we trust AI

Source: Sora

What are our data?

Palaeo data come in many forms

  • counts
  • %
  • biomass
  • concentrations
  • Presence or absence (?)
  • presence only?

Where we’ve sampled almost surely is not randomconvenience sample

Why is this a problem?

It isn’t

  • in a single site study
  • in a curated study of selected sites

Becomes a problem when we start to collate data into a large database

Why is this a problem

Even within a single proxy we have inconsistent data

  • raw data may not be available for some locations, only %
  • everyone counts things differently
  • everyone uses their own taxonomy

So much for old news…

No one in this room needs to be told any of this

Source: Sora

What do we want to do with them?

What do we want to do with them?

Themes from the website

  • conservation
  • diversity
  • resilience / sensitivity to change
  • ecological interactions
  • co-occurrence among proxies
  • aquatic – terrestrial linkages

How might we do it?

To start the conversation

Traditional methods used in palaeo are unlikely to help with analyses that compare across more than two taxonomic groups

  • coinertia / cocorrespondence analysis — pairs of data

What do we do if we have different resolution data within a proxy? Or different data representations?

Can we use all the data?

Newer methods

Lots of developments in the statistical ecology and omics worlds we can take advantage of

  • integrated SDMs

  • joint species distribution models

  • Model-based ordination

  • Copula models (marginal models for multivariate responses)

Integrated SDMs

Integrated species distribution models

General way to combine — integrate — disparate data

  1. species’ distributions are aggregated spatial locations of all individuals of the same species across a geographical domain

  2. the distribution can be described by a spatial point process, where local intensity (density) of individuals varies

  3. SDMs are a direct or indirect model of this underlying point process

  4. Data integration requires linking each data source to the common underlying point process while accounting for differences among data types

What is a point process?

A spatial point process describes the distribution of event locations across some spatial domain

Random process generating points, described by the local intensity \(\lambda_{s}\)

\(\lambda_{s}\) — expected density of points at spatial location \(s\)

If points are random, independent and follow a Poisson distribution with mean \(\lambda_{s}\), homogeneous Poisson process (\(\lambda_{s} \; \forall \; s\))

If \(\lambda_{s}\) varies across \(s\), we have an inhomogeneous Poisson process

Other distributions are available

These work in time as well

Miller et al (2019). Methods Ecol. Evol. 10.1111/2041-210X.13110

Joint likelihood

The different data sets have their own “model” and the likelihoods are combined during fitting

Allows mixing of different types of data

  • pointedSDMs 📦

Similar idea to combine likelihoods from different types of data

  • Jim Clarke’s Generalized joint attribute model (GJAM) in gjam 📦
  • gfam() family in the mgcv 📦

Miller et al (2019). Methods Ecol. Evol. 10.1111/2041-210X.13110

JSDMs

Instead of modelling one species at a time and stacking the models, Joint Species Distribution Models estimate all species at once

Ideally we’d combine integrated SDMs with JSDMs but as yet, I’m not aware of anything

JSDMs can be used to fit model-based ordinations — might hae to move away from traditional ordination methods to handle features of the data properly

  • gllvm
  • ecoCopula
  • boral
  • mvgam

Diversity

Any modelling of “diversity” needs to handle the sediment accumulation problem

Time averaging different amounts of time per sample leads to

  • heteroscedasticity
  • different effort — biases species richness etc

Same problem affects any modelling of any palaeo data, save for annually laminated records…

Diversity

Rare or data-deficient species?

Large training sets — throw out rare species, singletons etc

eDNA — “filtering” throws away a lot of data (& please don’t rarefy to counts)

Hierarchical models involving random “effects” allow us to borrow strength from more data-rich taxa

Sharma et al (in press). No species left behind: borrowing strength to map data-deficient species. Trends Ecol. Evol. 10.1016/j.tree.2025.04.010

What if we can’t?

If we can’t / don’t want to use these newer methods, what can we do with dissimilarities?

Fused dissimilarities

  • compute dissimilarity among samples for a single proxy / type of data separately
  • compute the fused dissimilarity \[d_{\text{fused}_{jk}} = w d_{x_{jk}} + (1 - w)d_{y_{jk}}\]
  • extends to \(\mathcal{N}\) different data sets \[d_{\text{fused}_{jk}} = \sum_{i = 1}^{\mathcal{N}} w_i d_{i_{jk}}, \;\; \text{where} \sum_{i=1}^{\mathcal{N}} w_i = 1\]

Then analyse using NMDS or db-RDA, etc.

Omics

Over in the Omics cinematic universe, those folks are doing their own thing integrating disparate kinds of data

Popular techniques are focused around extensions to PLS

Multiple different types of omics analysis on the same samples

Change of support

What if we don’t have the same proxies measured at the same set of sites? — spatial misalignment

What if proxies represent different amounts of space (time)?

This is covered under the problem of change of support and the concept of data fusion

Source: Giphy

Transfer functions

Remember Steve Juggin’s Sick science warning

The future?

Palaeopen 2.0

In hindsight palaeoecologists could have been doing things very differently 50 or 100 years ago, which would’ve been real useful to us now

How would we change the field today to make our future lives better when Palaeopen 2.0 comes around?

Extinction

Very hard to say diatom species x went extinct from this lake at this time

Most palaeo data is presence only

Possibly with associated marks — abundance or biomass conditional upon the taxon being found

We don’t know things about the taxa we don’t find

Hard to put a probability on (e.g.) extinction with this data

Repeat counts

But ecologists have been doing this kind of work for decades — occupancy modelling

Most methods require repeated sampling

What would that look like for palaeo?

Could we count same number of things but over \(n \geq 2\) different “samples”?

What would you change?

As we progress through Palaeopen, think about

  • what “future you” would’ve liked palaeoecologists of the past to have done

  • how would that change our field?

What would you change?

As we progress through Palaeopen, think about

  • what “future you” would’ve liked palaeoecologists of the past to have done

  • how would that change our field?

  • how do we achieve that?

Thank you